Video games recommendation system¶
The aim of this notebook is to create a recommendation system that will give the user products, similiar to the one they chose.
The iteration of the porject will be kept in a git repository
The project is created and work on by Mihail Kenarov
Introduction¶
In the beggining of the semester we were introduced to first steps of what AI is about and how it works. At the time we were given a the opportunity to create a project of our liking with the first submission being known as
Iteration 0, where we got to feedback for our initial ideas. For the submission I had just selected a dataset, which was full of empty values that I did not know what to do with and only mentioned the idea of the usage of kNN model, because to me it seemed like it was a matter of classificationFor the previous iteration(
Iteration 1) of the notebook, I had created a Recommendation system project, that was made by using the description of the games and putting them through the TF*IDF model.(Term Frequency-Inverse Document Frequency). After doing so the kernel sigmoid was used that was used to compare all of the games which are vectorized by putting them between a 0 and 1 range and then comparin them. Finally it printed out the top 5 games closest to the one we selected
- link for understading TF*IDF (https://www.youtube.com/watch?v=D2V1okCEsiE&ab_channel=KrishNaik)
- For this iteration(
Iteration 2) I am going to use another model and implement more data cleaning as well as more preprocessing . While doing the modeling I will try yo implement graphs which will give me a better representation of what is currently going on with the system. Finally, I will write a conclusion of what are the differences between the 2 tries
Importing libraries¶
import sklearn
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
print("scikit-learn version:", sklearn.__version__) # 1.4.1
print("pandas version:", pd.__version__) # 2.2.1
print("seaborn version:", sns.__version__) # 0.13.2
scikit-learn version: 1.4.1.post1 pandas version: 2.2.2 seaborn version: 0.13.2
Phase 1¶
Domain Understanding¶
There are many things that should be understood when it comes to the creation of a recommendation system, but let us have a quick look into the basis of the subject.
What are recommendation systems¶
A recommendation system, also known as a recommender system, is a subset of machine learning. It leverages data to assist in predicting, refining, and identifying what individuals are seeking amidst an exponentially expanding array of choices.
What types of recommendation systems are there¶
- Collaborative filtering - it is based on the user data and the item data that we have. It can vary into 2 different categories:
User - User Collaborative filtering - User-Based Collaborative Filtering is a method employed to anticipate the preferences of a user by considering the ratings provided to items by other users with similar tastes to the target user. This technique is commonly utilized by numerous websites to construct their recommendation systems.
Item - Item Collaborative filtering - Rather than matching the user to similar customers, item-to-item collaborative filtering matches each of the user’s purchased and rated items to similar items, then combines those similar items into a recommendation list.
Content based filtering - it is based mainly on the data that we have about the items can interact with. For example if they buy a certaion book, other books with either similiar genres, author, type etc. can be recommended.
Hybrid Recommendation system - This kind of system is one that is created by the both ones that were previously mentioned. It uses that inputs from collaborative filtering and content based filtering and merges it together, so that the overall accuracy is better and not only that but it eleminates some problems such as the 'cold start', where the lack of data is a problem at the beggining, after deployment.
Links for more detailed information:
User - User Collaborative filtering https://www.geeksforgeeks.org/user-based-collaborative-filtering/
Item - Item Collaborative filtering https://www.geeksforgeeks.org/item-to-item-based-collaborative-filtering/?ref=ml_lbp
Types of Recommendation systems https://marutitech.medium.com/what-are-the-types-of-recommendation-systems-3487cbafa7c9
What are some of the more famous algorithms used for such systems¶
Matrix Factorisation - Matrix factorization represents a category of collaborative filtering techniques utilized in recommendation systems. These algorithms function by breaking down the user-item interaction matrix into two lower-dimension rectangular matrices' product.
Nearest Neighbors (kNN) - The simplest algorithm computes cosine or correlation similarity of rows (users) or columns (items) and recommends items that k — nearest neighbors enjoyed.
TF*IDF - Term Frequency-Inverse Document Frequency, abbreviated as TF-IDF, is a metric that quantifies the significance of a word within a document in a collection or corpus, taking into account the adjustment for words that generally appear more frequently. It has been commonly utilized as a weighting factor in information retrieval, text mining, and user modeling searches.
What about some history of the recommendation systems¶
Can you guess which was the first recommendation system ever created ? - It was you! Recommendation systems have been with us since the creation of human time. It started exactly from us – the humans, spreading general ideas while talking to friends, family, or people we just enjoy being with, about things we would say go well together or we would like the people close to us to experience. These were the first ever recommendations that were ever given out and we still use them to this day.
With the evolution of technology, we even received even the first recommendation system, which was made by humans and operated on its own – “Grundy.” It was a system for the recommendation of books based on the users’ inputs. With time it started being criticized as all things in our world, especially in technology.
- More about 'Grundy' and the history of recommendation systems https://onespire.net/history-of-recommender-systems/
Want to know more about recommendation systems?¶
If you are interested in getting a deeper understading on the questions we just discussed with even more details, as well as if you have deep interests in the world of recommendation systems and want to know more about such topics as:
- Pros, Cons, Ethical Problems and Limitations of Recommendation systems
- Who is most effected by the usage of such systems and in what ways
- Where are they implemented and in what ways
Feel free to have a look into the Project Proposal that was attached to the submission, together with this notebook
Phase 2¶
Data Requirements:¶
Considering the fact that video games have been with us for quite a while now, there are some things that we should take into consideration when tackling the projetc at hand. The most important of which is the data we are going to select for the usage of this project. Knowing that poeple have different tastes, we will definetly need the genres. Not only that but let us be honest, if we are going to be creating a recommendation system we will need the names. A developer might be helpful, and possibly the description. Some people will be interesed in older games while some in the newer ones so it might not be a bad idea to have the date of release and possibly who it was released from. One bit part of recommendations are the ratings that are given out to most of the products which in our case are the games.
For now what we know we would want:
- Name : Text
- Genre : Text
- Description : Text
- Developer : Text
- Rating : Number
- Date of release: Date Time format
There might be some other possible features that could be of usage, but for now these are some of the main things that we are going to be looking for
Data collection:¶
After looking into multiple places where one can gather data for such a project, I have decided to get this one from Kaggle
https://www.kaggle.com/datasets/gsimonx37/backloggd
It has been collected from this site: https://www.backloggd.com/, which to me seems like a good, created by fans of video games site. It has different games, their genres, ratings from users and allows the users to express their opinions on certain games
The data in it seems to fit the criteria of what we may need to use.
Let's have a closer look into the datasets we are provided¶
import sklearn
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
games_df = pd.read_csv('games.csv')
devs_df = pd.read_csv('developers.csv')
genres_df = pd.read_csv('genres.csv')
platforms_df = pd.read_csv('platforms.csv')
scores_df = pd.read_csv('scores.csv')
Let's start one by one¶
Games dataframe
games_df.head()
| id | name | date | rating | reviews | plays | playing | backlogs | wishlists | description | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1000001 | Cathode Ray Tube Amusement Device | 1947-12-31 | 3.5 | 65.0 | 117.0 | 1.0 | 28.0 | 56.0 | The cathode ray tube amusement device is the e... |
| 1 | 1000002 | Bertie the Brain | 1950-08-25 | 2.5 | 11.0 | 24.0 | 0.0 | 6.0 | 12.0 | Currently considered the first videogame in hi... |
| 2 | 1000003 | Nim | 1951-12-31 | 1.8 | 2.0 | 11.0 | 0.0 | 2.0 | 6.0 | The Nimrod was a special purpose computer that... |
| 3 | 1000004 | Draughts | 1952-08-31 | 2.4 | 3.0 | 17.0 | 0.0 | 3.0 | 7.0 | A game of draughts (a.k.a. checkers) written f... |
| 4 | 1000005 | OXO | 1952-12-31 | 3.1 | 14.0 | 52.0 | 1.0 | 12.0 | 13.0 | OXO was a computer game developed by Alexander... |
games_df.shape
(172512, 10)
Devs dataframe
devs_df.head()
| id | developer | |
|---|---|---|
| 0 | 1000002 | Josef Kates |
| 1 | 1000004 | Christopher Strachey |
| 2 | 1000005 | Alexander Shafto "Sandy" Douglas |
| 3 | 1000005 | University of Warwick |
| 4 | 1000007 | William Higinbotham |
devs_df.shape
(143454, 2)
Genres Dataframe
genres_df.head()
| id | genre | |
|---|---|---|
| 0 | 1000001 | Point-and-Click |
| 1 | 1000002 | Puzzle |
| 2 | 1000002 | Tactical |
| 3 | 1000003 | Pinball |
| 4 | 1000003 | Strategy |
genres_df.shape
(286025, 2)
Platforms Dataset
platforms_df.head()
| id | platform | |
|---|---|---|
| 0 | 1000001 | Analogue electronics |
| 1 | 1000002 | Arcade |
| 2 | 1000003 | Ferranti Nimrod Computer |
| 3 | 1000004 | Legacy Computer |
| 4 | 1000005 | Windows PC |
platforms_df.shape
(261475, 2)
Scores Dataset
scores_df.head(15)
| id | score | amount | |
|---|---|---|---|
| 0 | 1000001 | 0.5 | 10 |
| 1 | 1000001 | 1.0 | 5 |
| 2 | 1000001 | 1.5 | 1 |
| 3 | 1000001 | 2.0 | 3 |
| 4 | 1000001 | 2.5 | 9 |
| 5 | 1000001 | 3.0 | 10 |
| 6 | 1000001 | 3.5 | 2 |
| 7 | 1000001 | 4.0 | 2 |
| 8 | 1000001 | 4.5 | 3 |
| 9 | 1000001 | 5.0 | 41 |
| 10 | 1000002 | 0.5 | 0 |
| 11 | 1000002 | 1.0 | 3 |
| 12 | 1000002 | 1.5 | 0 |
| 13 | 1000002 | 2.0 | 4 |
| 14 | 1000002 | 2.5 | 2 |
scores_df.shape
(1725120, 3)
Data Understanding:¶
Here is what we can gather from the information¶
The games dataset has 172512 rows and 10 columns
The developers dataset has 143454 rows and 2 columns
The genres dataset has 286025 rows and 2 columns
The platforms dataset has 261475 rows 2 columns
The scores dataset has 1725120 rows and 3 columns
But by what we are shown we can also create a dictionary that will allows us to have a bases of what we are working with and have a good overall look at what data is contaioned within what columns
Data Dictionary¶
- Games Dataset - basic data:
- id - video game identifier (primary key);
- name - name of the video game;
- date - release date of the video game;
- rating - average rating of the video game;
- reviews - number of reviews;
- plays - total number of players;
- playing - number of players currently (at the time)
- backlogs - the number of additions of a video game to the backlog;
- wishlists - the number of times a video game has been added to “wishlist” (want to buy);
- description - description of the video game.
- Developers dataset - developers (publishers):
- id - video game identifier (foreign key);
- developer - developer (publisher) of a video game.
- Platforms dataset - platforms of the games:
- id - video game identifier (foreign key);
- platform - gaming platform.
- Genres dataset - game genres:
- id - video game identifier (foreign key);
- genre - video game genre.
- Scores dataset - user ratings:
- id - video game identifier (foreign key);
- score - score (from 0.5 to 5 in increments of 0.5);
- amount - number of users that gave this score
Let's Dive deeper into the main dataset that we are currently going to use:
games_df
games_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 172512 entries, 0 to 172511 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 172512 non-null int64 1 name 172512 non-null object 2 date 137731 non-null object 3 rating 55569 non-null float64 4 reviews 172511 non-null float64 5 plays 171818 non-null float64 6 playing 171818 non-null float64 7 backlogs 171818 non-null float64 8 wishlists 171818 non-null float64 9 description 153588 non-null object dtypes: float64(6), int64(1), object(3) memory usage: 13.2+ MB
games_df.isnull().sum()
id 0 name 0 date 34781 rating 116943 reviews 1 plays 694 playing 694 backlogs 694 wishlists 694 description 18924 dtype: int64
games_df.head()
| id | name | date | rating | reviews | plays | playing | backlogs | wishlists | description | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1000001 | Cathode Ray Tube Amusement Device | 1947-12-31 | 3.5 | 65.0 | 117.0 | 1.0 | 28.0 | 56.0 | The cathode ray tube amusement device is the e... |
| 1 | 1000002 | Bertie the Brain | 1950-08-25 | 2.5 | 11.0 | 24.0 | 0.0 | 6.0 | 12.0 | Currently considered the first videogame in hi... |
| 2 | 1000003 | Nim | 1951-12-31 | 1.8 | 2.0 | 11.0 | 0.0 | 2.0 | 6.0 | The Nimrod was a special purpose computer that... |
| 3 | 1000004 | Draughts | 1952-08-31 | 2.4 | 3.0 | 17.0 | 0.0 | 3.0 | 7.0 | A game of draughts (a.k.a. checkers) written f... |
| 4 | 1000005 | OXO | 1952-12-31 | 3.1 | 14.0 | 52.0 | 1.0 | 12.0 | 13.0 | OXO was a computer game developed by Alexander... |
NB: We do see that a lot of the ratings are missing, but let's continue on, while keeping this in mind
Also we doo see that there is probably good correlation between the play,playing,backlogs and wishlists
While doing so we can also visualise some other curiosities like when were some games most of the games in the dataset made
# Ensure the 'date' column is in datetime format
#games_df['date'] = pd.to_datetime(games_df['date'])
# Extract the year from the date
#games_df['year'] = games_df['date'].dt.year
# Count the number of games released each year
#games_per_year = games_df['year'].value_counts().sort_index()
# Plot the counts
#plt.figure(figsize=(10, 6))
#games_per_year.plot(kind='bar')
#plt.title('Number of Games Released by Year')
#plt.xlabel('Year')
#plt.ylabel('Number of Games Released')
#plt.show()
After trying to run this code I ran into am error that suggested problematic formating with the years, specificly there was a game that was registered 6969-06-09, at position 12171. We can take a look into that later as well, however I am still interested in a more accurate view of when were most games released
problematic_row = games_df[games_df['date'] == '6969-06-09']
problematic_row
| id | name | date | rating | reviews | plays | playing | backlogs | wishlists | description | |
|---|---|---|---|---|---|---|---|---|---|---|
| 137716 | 1137717 | The Mysterious Cat Tower | 6969-06-09 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | A turn-based J/W/GRPG. Find spells to add to y... |
After looking up this The Mysterious Cat Tower, it is actually a game that is planned for that given year...
- link to the game https://store.steampowered.com/app/1706960/The_Mysterious_Cat_Tower/
Taking this into consideration... I guess the best thing to do is remove it because of the cat that it will neither be useful for the creation of our system, neither if a user ever wants to try it
# Get the index of the problematic row
problematic_index = problematic_row.index
# Drop the problematic row
games_df = games_df.drop(problematic_index)
Now we should be able to see the First time a game has a data given or the latest games that should be released
# Convert 'date' column to datetime, coercing errors to NaT
games_df['date'] = pd.to_datetime(games_df['date'], errors='coerce')
# Get the earliest (min) and latest (max) dates
min_date = games_df['date'].min()
max_date = games_df['date'].max()
print(f"Earliest date: {min_date}")
print(f"Latest date: {max_date}")
# Extract the year from the date
games_df['year'] = games_df['date'].dt.year
# Count the number of games released each year
games_per_year = games_df['year'].value_counts().sort_index()
# Plot the counts
plt.figure(figsize=(12, 6))
games_per_year.plot(kind='bar')
plt.title('Number of Games Released by Year')
plt.xlabel('Year')
plt.ylabel('Number of Games Released')
plt.show()
Earliest date: 1947-12-31 00:00:00 Latest date: 2030-12-20 00:00:00
I am having my doubts currently about the possible usage of the rows where the date is not given, but I also do not know if there is a correlation between the date and something else. For now, let us leave it again like this and start looking into the rest of the features and datasets
# Select only the numerical columns
numerical_games_df = games_df.select_dtypes(include=['int64', 'float64'])
# Compute the correlation matrix
corr_matrix = numerical_games_df.corr()
# Create a correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
plt.show()
As expected the reviews,plays,playing,backlogs and wishlists features do seem to have quite a good amount of correlation within eachother
# Select the columns to plot
cols = ['reviews', 'plays', 'playing', 'backlogs', 'wishlists', 'year']
subset_df = games_df[cols]
# Create a scatter plot matrix
sns.pairplot(subset_df)
plt.show()
Observation from the scatterplot:¶
we do not see a specific pattern that we can take up to work with, however we will probably get pack to these problems as well soon.
Genres dataset¶
Let us take a look into the genres dataset and see what we can do there
print(genres_df.nunique())
genres_df.head(15)
id 147369 genre 23 dtype: int64
| id | genre | |
|---|---|---|
| 0 | 1000001 | Point-and-Click |
| 1 | 1000002 | Puzzle |
| 2 | 1000002 | Tactical |
| 3 | 1000003 | Pinball |
| 4 | 1000003 | Strategy |
| 5 | 1000004 | Card & Board Game |
| 6 | 1000005 | Puzzle |
| 7 | 1000005 | Strategy |
| 8 | 1000006 | Sport |
| 9 | 1000007 | Arcade |
| 10 | 1000007 | Sport |
| 11 | 1000008 | Simulator |
| 12 | 1000009 | Shooter |
| 13 | 1000009 | Simulator |
| 14 | 1000010 | Strategy |
From what we see it does seem like it would be a wise idea to proceed in this direction:
1. Put all of the genres of a game to be on the same row
2. Encode it in a way so that the genres are looked into as a binary yes/no columns
Putting the genres of a game in the same row:
# Convert the 'genre' column to string
genres_df['genre'] = genres_df['genre'].astype(str)
# Group by 'id' and join the genres into a single string
genres_df = genres_df.groupby('id')['genre'].apply(', '.join).reset_index()
genres_df.head(10)
| id | genre | |
|---|---|---|
| 0 | 1000001 | Point-and-Click |
| 1 | 1000002 | Puzzle, Tactical |
| 2 | 1000003 | Pinball, Strategy |
| 3 | 1000004 | Card & Board Game |
| 4 | 1000005 | Puzzle, Strategy |
| 5 | 1000006 | Sport |
| 6 | 1000007 | Arcade, Sport |
| 7 | 1000008 | Simulator |
| 8 | 1000009 | Shooter, Simulator |
| 9 | 1000010 | Strategy |
Performing one-hot encoding to have the genres "vectorised"
# Perform one-hot encoding
genres_df_encoded = genres_df['genre'].str.get_dummies(sep=', ')
# Join the encoded genres back to the 'id' column
genres_df_encoded = pd.concat([genres_df['id'], genres_df_encoded], axis=1)
genres_df_encoded.head(10)
| id | Adventure | Arcade | Brawler | Card & Board Game | Fighting | Indie | MOBA | Music | Pinball | ... | RPG | Racing | Real Time Strategy | Shooter | Simulator | Sport | Strategy | Tactical | Turn Based Strategy | Visual Novel | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1000001 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1000002 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 2 | 1000003 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 3 | 1000004 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 1000005 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 5 | 1000006 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 6 | 1000007 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 7 | 1000008 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 8 | 1000009 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
| 9 | 1000010 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
10 rows × 24 columns
What would be let's say the top 10 most popular genres
# Convert genre names to lowercase, split on comma, strip whitespaces from each genre name, and count the occurrences
genre_counts = genres_df['genre'].str.lower().str.split(',').apply(lambda x: [i.strip() for i in x]).explode().value_counts()
# Select the top 10 genres
top_10_genres = genre_counts.head(10)
# Print the top 10 genres
print(top_10_genres)
# Create a bar plot of the top 10 genres
plt.figure(figsize=(10, 6))
sns.barplot(x=top_10_genres.index, y=top_10_genres.values)
plt.title('Top 10 Most Popular Genres')
plt.xlabel('Genre')
plt.ylabel('Count')
plt.xticks(rotation=90)
plt.show()
genre indie 50501 adventure 49653 simulator 22828 rpg 22320 strategy 21701 shooter 18542 puzzle 17496 arcade 14872 platform 14025 sport 10407 Name: count, dtype: int64
As expected, it would be normal to see that many of the games are of the Adventure as well as it would seem that we do have quite a lot of games which are Indie as well
Now let us combine what we have done until now¶
# Merge the two DataFrames on 'id'
combined_df = pd.merge(games_df, genres_df_encoded, on='id', how='inner')
combined_df_no_encoding = pd.merge(games_df, genres_df, on='id', how='inner')
print(combined_df.shape)
combined_df.head()
(147368, 34)
| id | name | date | rating | reviews | plays | playing | backlogs | wishlists | description | ... | RPG | Racing | Real Time Strategy | Shooter | Simulator | Sport | Strategy | Tactical | Turn Based Strategy | Visual Novel | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1000001 | Cathode Ray Tube Amusement Device | 1947-12-31 | 3.5 | 65.0 | 117.0 | 1.0 | 28.0 | 56.0 | The cathode ray tube amusement device is the e... | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1000002 | Bertie the Brain | 1950-08-25 | 2.5 | 11.0 | 24.0 | 0.0 | 6.0 | 12.0 | Currently considered the first videogame in hi... | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 2 | 1000003 | Nim | 1951-12-31 | 1.8 | 2.0 | 11.0 | 0.0 | 2.0 | 6.0 | The Nimrod was a special purpose computer that... | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 3 | 1000004 | Draughts | 1952-08-31 | 2.4 | 3.0 | 17.0 | 0.0 | 3.0 | 7.0 | A game of draughts (a.k.a. checkers) written f... | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 1000005 | OXO | 1952-12-31 | 3.1 | 14.0 | 52.0 | 1.0 | 12.0 | 13.0 | OXO was a computer game developed by Alexander... | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
5 rows × 34 columns
I do However question if it is possible to find a certain way to maybe use the genres to find the years, but I am still not sure if that may be possible¶
# Convert 'release_date' to datetime
combined_df['date'] = pd.to_datetime(combined_df['date'])
# List of genres
genres = ['Adventure', 'Arcade', 'Brawler', 'Card & Board Game', 'Fighting', 'Indie', 'MOBA', 'Music',
'Pinball', 'Platform', 'Point-and-Click', 'Puzzle', 'Quiz/Trivia', 'RPG', 'Racing',
'Real Time Strategy', 'Shooter', 'Simulator', 'Sport', 'Strategy', 'Tactical',
'Turn Based Strategy', 'Visual Novel']
# Resample the data by month and count the number of games for each genre
monthly_counts = combined_df.resample('M', on='date')[genres].sum()
# Create a line plot for each genre
plt.figure(figsize=(24, 8))
for genre in genres:
plt.plot(monthly_counts.index, monthly_counts[genre], label=genre)
plt.title('Number of Games Over Time by Genre')
plt.xlabel('Date')
plt.ylabel('Count')
plt.legend(loc='upper left', bbox_to_anchor=(1,1))
plt.show()
C:\Users\kenar\AppData\Local\Temp\ipykernel_4388\1492821078.py:11: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
monthly_counts = combined_df.resample('M', on='date')[genres].sum()
# Create a stacked area plot for each genre
plt.figure(figsize=(18, 6))
plt.stackplot(monthly_counts.index, monthly_counts[genres].T)
plt.title('Number of Games Over Time by Genre')
plt.xlabel('Date')
plt.ylabel('Count')
plt.legend(genres, loc='upper left', bbox_to_anchor=(1,1))
plt.show()
# Create a scatter plot for each genre
fig, axs = plt.subplots(len(genres), figsize=(10, 6*len(genres)))
for i, genre in enumerate(genres):
axs[i].scatter(monthly_counts.index, monthly_counts[genre])
axs[i].set_title('Number of ' + genre + ' Games Over Time')
axs[i].set_xlabel('Date')
axs[i].set_ylabel('Count')
plt.tight_layout()
plt.show()
# Reshape the DataFrame
reshaped_df = monthly_counts.reset_index().melt(id_vars='date', var_name='genre', value_name='count')
# Create a FacetGrid
g = sns.FacetGrid(reshaped_df, col='genre', col_wrap=5, height=4)
g = g.map(plt.plot, 'date', 'count')
plt.show()
It does not really feel like it so I will probably just leave it like that for now, Let us continue looking into the empty spots¶
games_df.isna().sum()
id 0 name 0 date 34781 rating 116942 reviews 1 plays 694 playing 694 backlogs 694 wishlists 694 description 18924 year 34781 dtype: int64
# Convert 'date' to datetime if it's not already
games_df['date'] = pd.to_datetime(games_df['date'])
# Extract the year from the date
games_df['release_year'] = games_df['date'].dt.year
# Calculate the average rating per year
average_rating_per_year = games_df.groupby('release_year')['rating'].mean()
# Print the average rating per year
print(average_rating_per_year)
release_year
1947.0 3.500000
1950.0 2.500000
1951.0 1.800000
1952.0 2.750000
1954.0 3.000000
...
2025.0 1.966667
2026.0 NaN
2027.0 NaN
2029.0 NaN
2030.0 NaN
Name: rating, Length: 72, dtype: float64
It is understandable that years which will be released in the future still do not have a rating. The cleaning will continue. Now we will take our attention to reviews
# Find rows where 'reviews' is NaN
empty_reviews_rows = games_df[games_df['reviews'].isna()]
# Print the empty reviews rows
empty_reviews_rows
| id | name | date | rating | reviews | plays | playing | backlogs | wishlists | description | year | release_year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 136540 | 1136541 | Enigma of Fear | 2024-03-31 | 2.3 | NaN | 21.0 | 0.0 | 42.0 | 230.0 | Become Mia, a paranormal detective searching f... | 2024.0 | 2024.0 |
This is understandable again, however I am interested to see what is the case with games that are anounced after 2024
# Create a subset of games released after 2024
games_after_2024 = games_df[games_df['release_year'] > 2024]
# Print the 'reviews' column of the subset
games_after_2024.sample(15)
| id | name | date | rating | reviews | plays | playing | backlogs | wishlists | description | year | release_year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 137685 | 1137686 | Tales of the Death | 2025-12-31 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 7.0 | Join a comic adventure in a hand-drawn book wi... | 2025.0 | 2025.0 |
| 137689 | 1137690 | DecaPolice | 2025-12-31 | NaN | 0.0 | 5.0 | 1.0 | 127.0 | 575.0 | DecaPolice, a crime-suspense RPG from Level-5,... | 2025.0 | 2025.0 |
| 137709 | 1137710 | Margaritari | 2026-01-01 | NaN | 0.0 | 0.0 | 0.0 | 4.0 | 12.0 | "Margaritari", a JRPG-style narrative game. Co... | 2026.0 | 2026.0 |
| 137674 | 1137675 | Grand Theft Auto VI | 2025-12-31 | NaN | 16.0 | 83.0 | 6.0 | 557.0 | 2452.0 | Grand Theft Auto VI heads to the state of Leon... | 2025.0 | 2025.0 |
| 137715 | 1137716 | PilotXross | 2030-12-20 | NaN | 0.0 | 2.0 | 0.0 | 2.0 | 3.0 | VR flight game developed for VR devices.The pl... | 2030.0 | 2030.0 |
| 137701 | 1137702 | Eternity Guards | 2025-12-31 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | The main idea of the game is a fantasy alterna... | 2025.0 | 2025.0 |
| 137680 | 1137681 | Mouse | 2025-12-31 | NaN | 1.0 | 0.0 | 0.0 | 87.0 | 371.0 | Join private detective John Mouston in MOUSE, ... | 2025.0 | 2025.0 |
| 137683 | 1137684 | Rise of Rebellion | 2025-12-31 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 4.0 | "Rise of Rebellion" is a 3D action RPG that pu... | 2025.0 | 2025.0 |
| 137684 | 1137685 | Twistales | 2025-12-31 | NaN | 0.0 | 0.0 | 0.0 | 1.0 | 11.0 | On the day of what she believed to be her ‘Hap... | 2025.0 | 2025.0 |
| 137669 | 1137670 | Monster Hunter Wilds | 2025-12-31 | NaN | 3.0 | 5.0 | 1.0 | 155.0 | 646.0 | Monster Hunter Wilds. The next generation in t... | 2025.0 | 2025.0 |
| 137668 | 1137669 | Silent Planet | 2025-12-31 | NaN | 0.0 | 0.0 | 0.0 | 2.0 | 2.0 | Silent Planet is an exploration-focused, dark ... | 2025.0 | 2025.0 |
| 137670 | 1137671 | Kipidon | 2025-12-31 | NaN | 0.0 | 1.0 | 0.0 | 1.0 | 5.0 | A colorful shooter that uses a flower as a cup... | 2025.0 | 2025.0 |
| 137655 | 1137656 | Dungellion | 2025-04-01 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Rogue-lite with elements of Battle Royale and ... | 2025.0 | 2025.0 |
| 137677 | 1137678 | Riversiders | 2025-12-31 | NaN | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | Raft down the river, camp in the wild, and mak... | 2025.0 | 2025.0 |
| 137699 | 1137700 | Big Boss: A Villain Simulator | 2025-12-31 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | An asymmetrical roguelite experience where the... | 2025.0 | 2025.0 |
After looking more into it and what the is going on on the site, it does show that there are games with reviews, although the their still have not been released, in that case it does seem alright if we just substitute the reviews of games after 2024 with a 0
Another example of what I saw:
However on the site, we are not shown any reviews - https://www.backloggd.com/games/2xko/
This is a problem that is happening to games that have not been released yet, so I do believe that it will be alright if we put the reviews at 0 for the games after 2024
games_df.sample(15)
| id | name | date | rating | reviews | plays | playing | backlogs | wishlists | description | year | release_year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6130 | 1006131 | Rambo | 1988-12-04 | 1.9 | 6.0 | 96.0 | 0.0 | 20.0 | 7.0 | Rambo is a side scrolling platform game where ... | 1988.0 | 1988.0 |
| 94916 | 1094917 | MMX: Otherworld Mystery - Expanded Edition | 2019-10-17 | NaN | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | The mystery of the Otherworldly World MMX will... | 2019.0 | 2019.0 |
| 112163 | 1112164 | No More Heroes III | 2021-08-27 | 3.9 | 440.0 | 2569.0 | 140.0 | 2435.0 | 1857.0 | The latest numbered entry in the No More Heroe... | 2021.0 | 2021.0 |
| 8571 | 1008572 | King's Bounty | 1990-12-31 | 3.0 | 1.0 | 32.0 | 0.0 | 23.0 | 5.0 | While King Maximus ruled the land, life was go... | 1990.0 | 1990.0 |
| 145671 | 1145672 | WSOP | NaT | NaN | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | Show off your Texas Hold'em Poker skills! The... | NaN | NaN |
| 54334 | 1054335 | Cities in Motion 2 | 2013-04-02 | 2.3 | 2.0 | 67.0 | 1.0 | 54.0 | 3.0 | Cities in Motion 2 (CIM2) is the sequel to the... | 2013.0 | 2013.0 |
| 74582 | 1074583 | Vikings: Wolves of Midgard | 2017-03-24 | 2.5 | 4.0 | 106.0 | 4.0 | 102.0 | 10.0 | Vikings: Wolves of Midgard takes you to the Sh... | 2017.0 | 2017.0 |
| 44345 | 1044346 | Hikari no Valusia ~What a Beautiful Hopes~ | 2009-11-20 | NaN | 0.0 | 2.0 | 0.0 | 6.0 | 8.0 | In a big city in the desert called Valcia, onl... | 2009.0 | 2009.0 |
| 155460 | 1155461 | Kanojo wa Dare to demo Sex suru. | NaT | NaN | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | A visual novel from Orcsoft Team Goblin. | NaN | NaN |
| 133864 | 1133865 | Inescapable: No Rules, No Rescue | 2023-10-19 | NaN | 1.0 | 5.0 | 0.0 | 8.0 | 17.0 | Inescapable is a social thriller set in a trop... | 2023.0 | 2023.0 |
| 150024 | 1150025 | Salvage | NaT | NaN | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | Turn-based bullet-hell class warfare | NaN | NaN |
| 56923 | 1056924 | Detective Grimoire: Secret of the Swamp | 2014-01-02 | 3.3 | 28.0 | 296.0 | 4.0 | 99.0 | 42.0 | Solve puzzles, collect clues, explore the swam... | 2014.0 | 2014.0 |
| 126182 | 1126183 | Hitman 3: Dubai | 2023-01-20 | 3.8 | 0.0 | 10.0 | 0.0 | 2.0 | 0.0 | Experience the grandeur and decadence of Dubai... | 2023.0 | 2023.0 |
| 121792 | 1121793 | Madden NFL 23 | 2022-08-19 | 2.3 | 45.0 | 252.0 | 18.0 | 21.0 | 11.0 | Play your way into the history books! Updates ... | 2022.0 | 2022.0 |
| 107040 | 1107041 | The Yellow Rose Motel | 2021-02-22 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | The yellow rose motel is a ps1/VHS style low p... | 2021.0 | 2021.0 |
# Drop 'year' and 'release_year' columns
games_df = games_df.drop(['year', 'release_year'], axis=1)
games_df.isna().sum()
id 0 name 0 date 34781 rating 116942 reviews 1 plays 694 playing 694 backlogs 694 wishlists 694 description 18924 dtype: int64
# Find rows where 'reviews' is NaN
empty_reviews_rows = games_df[games_df['reviews'].isna()]
# Print the empty reviews rows
empty_reviews_rows
| id | name | date | rating | reviews | plays | playing | backlogs | wishlists | description | |
|---|---|---|---|---|---|---|---|---|---|---|
| 136540 | 1136541 | Enigma of Fear | 2024-03-31 | 2.3 | NaN | 21.0 | 0.0 | 42.0 | 230.0 | Become Mia, a paranormal detective searching f... |
games_df['reviews'] = games_df['reviews'].fillna(0)
print(games_df)
id name date \
0 1000001 Cathode Ray Tube Amusement Device 1947-12-31
1 1000002 Bertie the Brain 1950-08-25
2 1000003 Nim 1951-12-31
3 1000004 Draughts 1952-08-31
4 1000005 OXO 1952-12-31
... ... ... ...
172507 1172508 Super Robot Wars 30: Digital Deluxe Edition NaT
172508 1172509 Xotic: Temple Crypt Expansion Pack NaT
172509 1172510 Dust Raiders NaT
172510 1172511 EXE Clash NaT
172511 1172512 Dance Killer Trick!!!: Boys, Be Dancing NaT
rating reviews plays playing backlogs wishlists \
0 3.5 65.0 117.0 1.0 28.0 56.0
1 2.5 11.0 24.0 0.0 6.0 12.0
2 1.8 2.0 11.0 0.0 2.0 6.0
3 2.4 3.0 17.0 0.0 3.0 7.0
4 3.1 14.0 52.0 1.0 12.0 13.0
... ... ... ... ... ... ...
172507 NaN 0.0 0.0 0.0 0.0 0.0
172508 NaN 0.0 1.0 0.0 0.0 0.0
172509 NaN 0.0 0.0 0.0 0.0 2.0
172510 NaN 0.0 0.0 0.0 0.0 1.0
172511 NaN 0.0 2.0 0.0 0.0 1.0
description
0 The cathode ray tube amusement device is the e...
1 Currently considered the first videogame in hi...
2 The Nimrod was a special purpose computer that...
3 A game of draughts (a.k.a. checkers) written f...
4 OXO was a computer game developed by Alexander...
... ...
172507 NaN
172508 Explore the mystical crypts and forgotten pass...
172509 Dust Raiders is a management strategy game, se...
172510 A platform fighting game featuring many spooky...
172511 Dance Killer Trick!!!: Boys, Be Dancing is an ...
[172511 rows x 10 columns]
games_df.isna().sum()
id 0 name 0 date 34781 rating 116942 reviews 0 plays 694 playing 694 backlogs 694 wishlists 694 description 18924 dtype: int64
Scatterplot for the with the genres - not encoded¶
genres_df.head()
| id | genre | |
|---|---|---|
| 0 | 1000001 | Point-and-Click |
| 1 | 1000002 | Puzzle, Tactical |
| 2 | 1000003 | Pinball, Strategy |
| 3 | 1000004 | Card & Board Game |
| 4 | 1000005 | Puzzle, Strategy |
# Select the columns to plot
cols1 = ['reviews', 'plays', 'playing', 'backlogs', 'wishlists', 'year']
subset_df = combined_df_no_encoding[cols1]
# Create a scatter plot matrix
sns.pairplot(subset_df)
plt.show()
print(games_df.isna().sum())
games_df.sample(15)
id 0 name 0 date 34781 rating 116942 reviews 0 plays 694 playing 694 backlogs 694 wishlists 694 description 18924 dtype: int64
| id | name | date | rating | reviews | plays | playing | backlogs | wishlists | description | |
|---|---|---|---|---|---|---|---|---|---|---|
| 89230 | 1089231 | VMod | 2019-01-10 | NaN | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | VMod is a puzzle game about side-effects. Tap ... |
| 73562 | 1073563 | Elaine | 2017-02-14 | NaN | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | This is elaine! A minimalistic tile-matching a... |
| 61256 | 1061257 | GrindCraft | 2015-01-30 | 2.5 | 3.0 | 10.0 | 0.0 | 0.0 | 0.0 | GrindCraft is a Minecraft-themed clicker game ... |
| 115156 | 1115157 | PsiloSybil | 2021-12-07 | 3.2 | 5.0 | 34.0 | 3.0 | 48.0 | 98.0 | An old-school, tough-as nails classic linear 3... |
| 114581 | 1114582 | Door 3: Insignia | 2021-11-15 | NaN | 0.0 | 0.0 | 0.0 | 3.0 | 0.0 | Symbols have never had such an impact on a per... |
| 119409 | 1119410 | Under the Jolly Roger: Complete Edition | 2022-05-23 | NaN | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | The Complete Edition includes: - Under the Jol... |
| 88024 | 1088025 | Daze | 2018-11-27 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | The world you’re about to experience Crafted b... |
| 90838 | 1090839 | City of Heroes: Homecoming | 2019-04-02 | NaN | 0.0 | 3.0 | 4.0 | 0.0 | 0.0 | City of Heroes: Homecoming is an officially li... |
| 129848 | 1129849 | PonyGuessr | 2023-05-17 | NaN | 1.0 | 3.0 | 0.0 | 0.0 | 1.0 | Guess the episode each My Little Pony: Friends... |
| 21081 | 1021082 | Red Baron II | 1998-10-30 | NaN | 0.0 | 5.0 | 0.0 | 3.0 | 0.0 | Red Baron II by Dynamix, Inc is the sequel to ... |
| 46288 | 1046289 | Leo & Leah | 2010-07-22 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Rather simple. Leo and Leah have been close si... |
| 23580 | 1023581 | NASCAR Challenge | 2000-01-01 | 2.8 | 1.0 | 5.0 | 0.0 | 2.0 | 0.0 | The thrill of victory and the agony of smashin... |
| 162822 | 1162823 | Shadows of the Damned | NaT | NaN | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | Well after its 2011 Xbox 360 and PlayStation 3... |
| 36824 | 1036825 | The Mysterious Mine Bouncin' Back Edition | 2006-12-21 | NaN | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | This is the author’s Earthbound hack that they... |
| 46254 | 1046255 | duplicate Hakuoki Junsouroku | 2010-07-17 | NaN | 0.0 | 2.0 | 0.0 | 2.0 | 0.0 | Released in Japan in 2010. |
# Select rows where 'plays' is NaN
nan_plays = games_df[games_df['plays'].isna()]
# Print the selected rows
nan_plays.sample(15)
| id | name | date | rating | reviews | plays | playing | backlogs | wishlists | description | |
|---|---|---|---|---|---|---|---|---|---|---|
| 6591 | 1006592 | Circus Games | 1988-12-31 | NaN | 0.0 | NaN | NaN | NaN | NaN | Ladiiiieeeees and gentlemeeeeennnnn! Children ... |
| 74575 | 1074576 | The Cat Games | 2017-03-24 | 3.1 | 5.0 | NaN | NaN | NaN | NaN | Do you like cats? Then this is the purrfect ga... |
| 147678 | 1147679 | Play the Games Vol. 5 | NaT | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN |
| 45924 | 1045925 | Family Gameshow | 2010-06-04 | NaN | 0.0 | NaN | NaN | NaN | NaN | Good evening and welcome to Family Gameshow! T... |
| 132066 | 1132067 | Food Fighter Clicker Games | 2023-08-22 | NaN | 0.0 | NaN | NaN | NaN | NaN | Food Fighter Clicker is a clicker game to beco... |
| 34259 | 1034260 | 2 Games in 1: Sonic Pinball Party + Sonic Battle | 2005-11-11 | NaN | 0.0 | NaN | NaN | NaN | NaN | Bundle containing Sonic Pinball Party and Soni... |
| 24633 | 1024634 | Barbie: Fashion Pack Games | 2000-10-01 | NaN | 0.0 | NaN | NaN | NaN | NaN | Discover a world of fashion fun with Barbie! P... |
| 145376 | 1145377 | Game Chest: Board Games | NaT | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN |
| 11504 | 1011505 | California Games II | 1992-12-31 | NaN | 0.0 | NaN | NaN | NaN | NaN | Amiga port of California Games II. |
| 79422 | 1079423 | Red | 2017-11-01 | NaN | 0.0 | NaN | NaN | NaN | NaN | Try your best in the hardest game ever release... |
| 159061 | 1159062 | Pack 2 Games Pony Friends 2 + My Riding Stable... | NaT | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN |
| 11783 | 1011784 | Summer Games | 1992-12-31 | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN |
| 11686 | 1011687 | California Games II | 1992-12-31 | NaN | 0.0 | NaN | NaN | NaN | NaN | Atari ST port of California Games II. |
| 102175 | 1102176 | The Dreamcatcher | 2020-08-21 | NaN | 1.0 | NaN | NaN | NaN | NaN | There is a saying that "what you think about i... |
| 58506 | 1058507 | Escape from Jay Is Games | 2014-06-06 | NaN | 0.0 | NaN | NaN | NaN | NaN | It's a room escape point and click puzzle game... |
After having a look into the samples of the games that don't have any plays, it does seem that they are connected heavily with the playing.backlogs and wishlists. There are quite some examples of data where we do not have neither the the release date or any of the playing.backlogs,wishlists,plays - notting when it comes to the description as well.
Examples:
I looked them up on the site and I did see some cases where we might be able to find something, but it does not look like it
https://www.backloggd.com/games/play-the-games-vol-5/
After finding this I do believe that It will be alright if we drop the columns that do not have neither a date nor anything else, they are pretty much empty rows with nothing but a title and id, so there is no reason for us to have a recommendation system that does not give anything of actaul value, to the user.
print(nan_plays.shape)
nan_plays.isna().sum()
(694, 10)
id 0 name 0 date 158 rating 530 reviews 0 plays 694 playing 694 backlogs 694 wishlists 694 description 172 dtype: int64
nan_plays.dropna(subset=['date', 'plays'], how='all', inplace=True)
nan_plays.isna().sum()
C:\Users\kenar\AppData\Local\Temp\ipykernel_4388\2504222685.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy nan_plays.dropna(subset=['date', 'plays'], how='all', inplace=True)
id 0 name 0 date 0 rating 386 reviews 0 plays 536 playing 536 backlogs 536 wishlists 536 description 67 dtype: int64
nan_plays.sample(15)
| id | name | date | rating | reviews | plays | playing | backlogs | wishlists | description | |
|---|---|---|---|---|---|---|---|---|---|---|
| 6855 | 1006856 | California Games | 1989-05-01 | 2.8 | 0.0 | NaN | NaN | NaN | NaN | Master System port of California Games. |
| 4617 | 1004618 | Future Games | 1986-12-31 | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN |
| 103258 | 1103259 | Encore Classic Casino Games | 2020-10-05 | NaN | 0.0 | NaN | NaN | NaN | NaN | You've hit the jackpot with the most comprehen... |
| 51498 | 1051499 | 50 Classic Games 3D | 2012-04-24 | NaN | 1.0 | NaN | NaN | NaN | NaN | Now your favorite classic pastimes are bundled... |
| 107691 | 1107692 | Bocchi Kaihi | 2021-03-17 | NaN | 0.0 | NaN | NaN | NaN | NaN | Rescue the mysterious main character with a sh... |
| 42138 | 1042139 | Best of Arcade Games DS | 2009-01-28 | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN |
| 15177 | 1015178 | Family Games | 1995-02-01 | NaN | 0.0 | NaN | NaN | NaN | NaN | Here's a game collection for the whole family.... |
| 4403 | 1004404 | Future Games | 1986-12-31 | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN |
| 17841 | 1017842 | 3DO Games: Decathlon | 1996-10-31 | NaN | 0.0 | NaN | NaN | NaN | NaN | As you would expect, all ten events are repres... |
| 27691 | 1027692 | Love Game's Wai Wai Tennis Plus | 2002-04-28 | NaN | 0.0 | NaN | NaN | NaN | NaN | Tune up for "Wai Wai Tennis", the authentic po... |
| 34217 | 1034218 | Clubhouse Games | 2005-11-03 | 3.5 | 28.0 | NaN | NaN | NaN | NaN | It's game night and everyone's invited. Play m... |
| 3594 | 1003595 | Rainy Day Games | 1985-12-31 | NaN | 0.0 | NaN | NaN | NaN | NaN | Rainy Day Games is 1 - 3 player series of 3 ca... |
| 42283 | 1042284 | Clubhouse Games Express: Strategy Pack | 2009-02-25 | NaN | 0.0 | NaN | NaN | NaN | NaN | A DSiWare game based on the original Clubhouse... |
| 101651 | 1101652 | Rage of Car Force: Car Crashing Games | 2020-07-30 | NaN | 0.0 | NaN | NaN | NaN | NaN | Rage of Car Force is a team multiplayer PvP ga... |
| 18670 | 1018671 | Love Game's Wai Wai Tennis | 1997-02-28 | NaN | 0.0 | NaN | NaN | NaN | NaN | NaN |
After looking into more of the examples that we have for games that do not have any values in the ones we are currently interested in it's either because of 2 things it would seem:
1. It is some sort of a bundel of games that we do not have any details about
2. The games actaully do not have that much data for them so they pretty much are 'left out' and have not been interacted with
____________________________________________________________________________________Another Example __________________________________________________________________________
After considering all of this, I do believe that it will be fine to just drop all of the rows that have the missing values in plays,playing,backlogs and wishlists
games_df.sample(15)
| id | name | date | rating | reviews | plays | playing | backlogs | wishlists | description | |
|---|---|---|---|---|---|---|---|---|---|---|
| 169786 | 1169787 | Super Robot Taisen Compact 3 | NaT | NaN | 0.0 | 3.0 | 0.0 | 6.0 | 4.0 | Super Robot Taisen Compact 3 is the final Wond... |
| 100074 | 1100075 | Sonic 3D in 2D | 2020-05-20 | 2.5 | 7.0 | 27.0 | 5.0 | 14.0 | 6.0 | Sonic 3D in 2D is a fangame that reimagines So... |
| 110701 | 1110702 | Labyrinths of the World: The Game of Minds - C... | 2021-07-02 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | Desperate Times Call for Desperate Measures... |
| 144580 | 1144581 | Rivals at War: Firefight | NaT | NaN | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | Rivals at War: Firefight is a military themed,... |
| 72497 | 1072498 | Halodoom: Code of Silence | 2016-12-31 | NaN | 0.0 | 3.0 | 0.0 | 6.0 | 4.0 | Halodoom is a project about the past and the p... |
| 69030 | 1069031 | Robot Legions Reborn | 2016-07-19 | NaN | 0.0 | 3.0 | 0.0 | 0.0 | 0.0 | On a planet overrun by robots, one rogue unit ... |
| 64785 | 1064786 | Sword Coast Legends | 2015-10-20 | 1.7 | 4.0 | 33.0 | 1.0 | 22.0 | 3.0 | Set in the lush and vibrant world of the Forgo... |
| 48990 | 1048991 | Wii Play: Motion | 2011-06-13 | 3.1 | 29.0 | 418.0 | 3.0 | 59.0 | 49.0 | Wii Play: Motion is a minigame collection that... |
| 44854 | 1044855 | Situation Outbreak | 2009-12-31 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Situation Outbreak is an Orange Box mod about ... |
| 130056 | 1130057 | Sunshine Shuffle | 2023-05-24 | 3.0 | 14.0 | 32.0 | 0.0 | 14.0 | 28.0 | Sunshine Shuffle is a narrative poker adventur... |
| 4970 | 1004971 | Hysteria | 1987-08-01 | NaN | 0.0 | 2.0 | 0.0 | 3.0 | 0.0 | A fanatical sect has changed mankind's future ... |
| 40905 | 1040906 | Family Trainer | 2008-09-26 | 2.8 | 2.0 | 19.0 | 0.0 | 2.0 | 0.0 | An outdoor sports themed mini-game collection ... |
| 104103 | 1104104 | Deviant Anomalies | 2020-10-31 | NaN | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | Deviant Anomalies is an adult visual novel and... |
| 127095 | 1127096 | Ages of Conflict: World War Simulator | 2023-02-17 | 2.6 | 1.0 | 18.0 | 0.0 | 4.0 | 1.0 | Ages of Conflict is a versatile Map Simulation... |
| 44936 | 1044937 | Xuan-Yuan Sword: The Clouds Faraway | 2010-01-12 | NaN | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 軒轅劍外傳: 雲之遙 (Chinese) The Clouds Faraway, also... |
games_df.isna().sum()
id 0 name 0 date 34781 rating 116942 reviews 0 plays 694 playing 694 backlogs 694 wishlists 694 description 18924 dtype: int64
# Drop rows where 'plays', 'playing', 'backlogs', or 'wishlists' are NaN
games_df = games_df.dropna(subset=['plays', 'playing', 'backlogs', 'wishlists'])
# Print the DataFrame to verify the changes
games_df.sample(15)
| id | name | date | rating | reviews | plays | playing | backlogs | wishlists | description | |
|---|---|---|---|---|---|---|---|---|---|---|
| 72032 | 1072033 | Defense Zone 3 Ultra HD | 2016-12-14 | NaN | 0.0 | 5.0 | 0.0 | 0.0 | 0.0 | The sequel to the hit strategy game, with new ... |
| 138896 | 1138897 | Get Backers Dakkanoku: Ubawareta Mugen Shiro | NaT | NaN | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | NaN |
| 118864 | 1118865 | SuperDungeon MegaCorp | 2022-05-02 | NaN | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | Welcome to SuperDungeon MegaCorp, a puzzle gam... |
| 41982 | 1041983 | Gauntlet | 2008-12-31 | NaN | 2.0 | 4.0 | 0.0 | 4.0 | 1.0 | Gauntlet DS is the cancelled chapter of the po... |
| 41062 | 1041063 | Crazy Mouse | 2008-10-15 | NaN | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | NaN |
| 62826 | 1062827 | My Bewitching Perfume | 2015-06-01 | NaN | 0.0 | 1.0 | 0.0 | 6.0 | 4.0 | An exciting love simulation game for girls! It... |
| 14574 | 1014575 | FIFA International Soccer | 1994-12-31 | NaN | 0.0 | 1.0 | 0.0 | 2.0 | 0.0 | 8-bit port of FIFA International Soccer. |
| 14906 | 1014907 | Seal of the Pharaoh | 1994-12-31 | NaN | 0.0 | 1.0 | 0.0 | 11.0 | 4.0 | Seal of the Pharaoh is a mix of RPG, puzzle-so... |
| 34630 | 1034631 | AFL Premiership 2005 | 2005-12-31 | NaN | 1.0 | 7.0 | 0.0 | 0.0 | 2.0 | AFL Premiership 2005 is based off the Australi... |
| 110707 | 1110708 | MontanaBlack Kylo's Rescue | 2021-07-02 | NaN | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | "You're kidnapping Kylo just because I don't w... |
| 117577 | 1117578 | WWE 2K22 | 2022-03-11 | 3.3 | 84.0 | 875.0 | 52.0 | 65.0 | 48.0 | WWE 2K returns with all the features you can h... |
| 42735 | 1042736 | Burnout Paradise: Cops and Robbers | 2009-04-30 | 3.3 | 1.0 | 8.0 | 0.0 | 1.0 | 3.0 | The Cops and Robbers Pack is a downloadable pr... |
| 87877 | 1087878 | Fruit Salad | 2018-11-18 | NaN | 0.0 | 1.0 | 0.0 | 2.0 | 0.0 | Complevel 9. Maps 01-06. The palette will chan... |
| 24626 | 1024627 | EZ2Dancer | 2000-09-30 | NaN | 0.0 | 4.0 | 0.0 | 0.0 | 0.0 | EZ2Dancer is a series of dance video games dev... |
| 25171 | 1025172 | Beatmania III: Append Core Remix | 2000-12-31 | 4.3 | 0.0 | 4.0 | 0.0 | 0.0 | 0.0 | Beatmania III: Append Core Remix is a rhythm g... |
games_df.isna().sum()
id 0 name 0 date 34623 rating 116412 reviews 0 plays 0 playing 0 backlogs 0 wishlists 0 description 18752 dtype: int64
scores_df.head(15)
| id | score | amount | |
|---|---|---|---|
| 0 | 1000001 | 0.5 | 10 |
| 1 | 1000001 | 1.0 | 5 |
| 2 | 1000001 | 1.5 | 1 |
| 3 | 1000001 | 2.0 | 3 |
| 4 | 1000001 | 2.5 | 9 |
| 5 | 1000001 | 3.0 | 10 |
| 6 | 1000001 | 3.5 | 2 |
| 7 | 1000001 | 4.0 | 2 |
| 8 | 1000001 | 4.5 | 3 |
| 9 | 1000001 | 5.0 | 41 |
| 10 | 1000002 | 0.5 | 0 |
| 11 | 1000002 | 1.0 | 3 |
| 12 | 1000002 | 1.5 | 0 |
| 13 | 1000002 | 2.0 | 4 |
| 14 | 1000002 | 2.5 | 2 |
Let us perfom a matematical way of filling in the gaps of the ratings column, with the help of the scores dataset¶
Things that should be considered: While looking into the ratings I found out that a game might have a NaN rating because well - It was not given one, however there is also the possible change that the game could have been given a rating on the site, but just because it is only one "floating" rating, it was not considered and therefore, not put as the average rating (which is understandable, an average of 1 rating is a bit weird to be taken into consideration as a good, all around rating for a game)
Example of what I explained earlier up here ↑
1. We will look into games_df and see which rows have NaN "rating"
2. Find the corresponding id of the game with the scores it's given in the scores_df
3. Calculate the average rating of the game with a certain id
4. Fill in the NaN value in games_df of with the average rating that was just calculated
# Identify the rows in games_df where 'rating' is NaN
nan_rating_ids = games_df[games_df['rating'].isna()]['id']
# For each of these rows, find the corresponding rows in scores_df
scores_df_filtered = scores_df[scores_df['id'].isin(nan_rating_ids)]
# Calculate the weighted average rating for each game
scores_df_filtered['total_score'] = scores_df_filtered['score'] * scores_df_filtered['amount']
grouped_scores = scores_df_filtered.groupby('id').agg({'total_score': 'sum', 'amount': 'sum'}).reset_index()
grouped_scores['average_rating'] = (grouped_scores['total_score'] / grouped_scores['amount']).round(1)
# Replace the NaN values in 'rating' in games_df with the calculated average rating
games_df.set_index('id', inplace=True)
grouped_scores.set_index('id', inplace=True)
games_df['rating'].fillna(grouped_scores['average_rating'], inplace=True)
games_df.reset_index(inplace=True)
C:\Users\kenar\AppData\Local\Temp\ipykernel_4388\2964695771.py:8: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy scores_df_filtered['total_score'] = scores_df_filtered['score'] * scores_df_filtered['amount']
C:\Users\kenar\AppData\Local\Temp\ipykernel_4388\2964695771.py:15: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
games_df['rating'].fillna(grouped_scores['average_rating'], inplace=True)
print(games_df.isna().sum())
games_df.sample(15)
id 0 name 0 date 34623 rating 79226 reviews 0 plays 0 playing 0 backlogs 0 wishlists 0 description 18752 dtype: int64
| id | name | date | rating | reviews | plays | playing | backlogs | wishlists | description | |
|---|---|---|---|---|---|---|---|---|---|---|
| 132186 | 1132712 | Yummy Jewels | 2023-09-14 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Go on a candy puzzle adventure. |
| 64917 | 1065262 | Demented | 2015-11-18 | NaN | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | Enter the crooked world of Demented, where eve... |
| 10875 | 1010941 | Jennifer Capriati Tennis | 1992-09-16 | 3.5 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | This is a tennis game featuring the game modes... |
| 22233 | 1022356 | Shiritsu Justice Gakuen: Nekketsu Seishun Nikki 2 | 1999-06-24 | 4.0 | 3.0 | 33.0 | 0.0 | 9.0 | 10.0 | This is anPlayStation-exclusive update to the ... |
| 140350 | 1140898 | Sonic Classics: 3-in-1 | NaT | 2.7 | 0.0 | 7.0 | 1.0 | 2.0 | 1.0 | A compilation cartridge containing 3 Sonic the... |
| 72328 | 1072696 | Variant: Limits | 2017-01-09 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Variant: Limits connects mathematics and game ... |
| 168951 | 1169638 | BoBoiBoy: Adudu Attacks! | NaT | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN |
| 92181 | 1092603 | O2Jam | 2019-06-30 | NaN | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | Enjoy the classic rhythm game for everyone! En... |
| 136346 | 1136883 | Terminator: Survivors | 2024-10-24 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | Play as a survivor in an open world set after ... |
| 59861 | 1060183 | LEGO Batman Trilogy | 2014-11-11 | 3.8 | 5.0 | 170.0 | 1.0 | 35.0 | 6.0 | NaN |
| 86758 | 1087166 | Fly High | 2018-10-19 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Fly High is action-adventure game with fantasy... |
| 37130 | 1037336 | Cradle of Rome | 2007-02-27 | 3.0 | 1.0 | 14.0 | 0.0 | 0.0 | 1.0 | Build the heart of the Ancient Roman Empire an... |
| 7197 | 1007251 | Clown-o-Mania | 1989-12-31 | NaN | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | An obscure arcade-style game for the Amiga and... |
| 79799 | 1080185 | Cleaning house | 2017-12-04 | 2.9 | 0.0 | 6.0 | 1.0 | 2.0 | 2.0 | a little game about a house, and what is left ... |
| 66310 | 1066660 | The Gate of Firmament | 2016-02-25 | NaN | 0.0 | 1.0 | 0.0 | 3.0 | 2.0 | The “Xuan-Yuan Sword” is an epic oriental RPG ... |
Previously we have 116412 missing values in the rating column.
Now we have 79226. It is one way or another us going into the right direction, but I am still not sure this is enough. It is possible that this happened due to the fact that they were in fact having not a single score given to them
example:
what is in the scores_df when it comes to this game's ratings:
id,score,amount
1130982,0.5,0
1130982,1.0,0
1130982,1.5,0
1130982,2.0,0
1130982,2.5,0
1130982,3.0,0
1130982,3.5,0
1130982,4.0,0
1130982,4.5,0
1130982,5.0,0
new_combined_games = games_df.merge(genres_df_encoded, on='id', how='inner')
new_combined_games.head(15)
| id | name | date | rating | reviews | plays | playing | backlogs | wishlists | description | ... | RPG | Racing | Real Time Strategy | Shooter | Simulator | Sport | Strategy | Tactical | Turn Based Strategy | Visual Novel | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1000001 | Cathode Ray Tube Amusement Device | 1947-12-31 | 3.5 | 65.0 | 117.0 | 1.0 | 28.0 | 56.0 | The cathode ray tube amusement device is the e... | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1000002 | Bertie the Brain | 1950-08-25 | 2.5 | 11.0 | 24.0 | 0.0 | 6.0 | 12.0 | Currently considered the first videogame in hi... | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 2 | 1000003 | Nim | 1951-12-31 | 1.8 | 2.0 | 11.0 | 0.0 | 2.0 | 6.0 | The Nimrod was a special purpose computer that... | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 3 | 1000004 | Draughts | 1952-08-31 | 2.4 | 3.0 | 17.0 | 0.0 | 3.0 | 7.0 | A game of draughts (a.k.a. checkers) written f... | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 1000005 | OXO | 1952-12-31 | 3.1 | 14.0 | 52.0 | 1.0 | 12.0 | 13.0 | OXO was a computer game developed by Alexander... | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 5 | 1000006 | Pool | 1954-06-26 | 3.0 | 5.0 | 20.0 | 0.0 | 2.0 | 4.0 | A game of pool (billiards) developed by Willia... | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 6 | 1000007 | Tennis for Two | 1958-10-18 | 3.0 | 41.0 | 100.0 | 0.0 | 18.0 | 29.0 | Tennis for Two is often credited to be the wor... | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 7 | 1000008 | Mouse in the Maze | 1959-01-16 | 2.6 | 3.0 | 17.0 | 0.0 | 2.0 | 6.0 | A game where players place maze walls, bits of... | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 8 | 1000009 | Spacewar! | 1962-04-30 | 3.0 | 25.0 | 124.0 | 0.0 | 23.0 | 36.0 | Spacewar! is one of the earliest digital compu... | ... | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
| 9 | 1000010 | The Sumerian Game | 1964-12-31 | 2.6 | 3.0 | 17.0 | 0.0 | 7.0 | 7.0 | The Sumerian Game is a text-based strategy vid... | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 10 | 1000011 | Periscope | 1965-12-31 | 2.1 | 2.0 | 20.0 | 0.0 | 7.0 | 11.0 | The electro-mechanical game was released in th... | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 11 | 1000012 | Hamurabi | 1968-12-31 | 2.4 | 14.0 | 46.0 | 0.0 | 4.0 | 11.0 | Hamurabi is a text-based game of land and reso... | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 12 | 1000013 | Civil War | 1968-12-31 | 1.8 | 3.0 | 14.0 | 0.0 | 2.0 | 2.0 | A turn-based, strategic simulation of fourteen... | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
| 13 | 1000014 | Indy 500 | 1969-03-31 | 1.0 | 0.0 | 6.0 | 0.0 | 5.0 | 3.0 | A first-person arcade racing game released by ... | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 14 | 1000015 | Yakyuuken | 1969-04-27 | NaN | 1.0 | 3.0 | 0.0 | 3.0 | 2.0 | One of the very first erotic video games ever ... | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
15 rows × 33 columns
column_names = new_combined_games.columns
column_names
Index(['id', 'name', 'date', 'rating', 'reviews', 'plays', 'playing',
'backlogs', 'wishlists', 'description', 'Adventure', 'Arcade',
'Brawler', 'Card & Board Game', 'Fighting', 'Indie', 'MOBA', 'Music',
'Pinball', 'Platform', 'Point-and-Click', 'Puzzle', 'Quiz/Trivia',
'RPG', 'Racing', 'Real Time Strategy', 'Shooter', 'Simulator', 'Sport',
'Strategy', 'Tactical', 'Turn Based Strategy', 'Visual Novel'],
dtype='object')
new_combined_games.isna().sum()
id 0 name 0 date 22013 rating 63288 reviews 0 plays 0 playing 0 backlogs 0 wishlists 0 description 7694 Adventure 0 Arcade 0 Brawler 0 Card & Board Game 0 Fighting 0 Indie 0 MOBA 0 Music 0 Pinball 0 Platform 0 Point-and-Click 0 Puzzle 0 Quiz/Trivia 0 RPG 0 Racing 0 Real Time Strategy 0 Shooter 0 Simulator 0 Sport 0 Strategy 0 Tactical 0 Turn Based Strategy 0 Visual Novel 0 dtype: int64
It does seem like some of the ratings droped, due to the merging
old results:
id 0
name 0
date 34623
rating 79226
reviews 0
plays 0
playing 0
backlogs 0
wishlists 0
description 18752
new results:
id 0
name 0
date 22013
rating 63288
reviews 0
plays 0
playing 0
backlogs 0
wishlists 0
description 7694
Adventure 0
Arcade 0
Brawler 0
Card & Board Game 0
Fighting 0
Indie 0
MOBA 0
Music 0
Pinball 0
Platform 0
Point-and-Click 0
Puzzle 0
Quiz/Trivia 0
RPG 0
Racing 0
Real Time Strategy 0
Shooter 0
Simulator 0
Sport 0
Strategy 0
Tactical 0
Turn Based Strategy 0
Visual Novel 0
new_combined_games.shape
(146872, 33)
Let's try some modeling¶
For this we are going to be using jaccard similiarity. It pretty much measures in the span between 0 and 1, witing 2 files/sets of data - the closer to 1 they are calculated, the more they are similiar to eachother
https://medium.com/@mayurdhvajsinhjadeja/jaccard-similarity-34e2c15fb524
# List of genre columns
genre_columns = [
'Adventure', 'Arcade', 'Brawler', 'Card & Board Game', 'Fighting', 'Indie', 'MOBA',
'Music', 'Pinball', 'Platform', 'Point-and-Click', 'Puzzle', 'Quiz/Trivia', 'RPG',
'Racing', 'Real Time Strategy', 'Shooter', 'Simulator', 'Sport', 'Strategy', 'Tactical',
'Turn Based Strategy', 'Visual Novel'
]
# Create a list of columns to keep
columns_to_keep = ['name'] + genre_columns
# Create a copy of the DataFrame with only these columns
jaccard_df = new_combined_games[columns_to_keep].copy()
jaccard_df.head(15)
| name | Adventure | Arcade | Brawler | Card & Board Game | Fighting | Indie | MOBA | Music | Pinball | ... | RPG | Racing | Real Time Strategy | Shooter | Simulator | Sport | Strategy | Tactical | Turn Based Strategy | Visual Novel | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Cathode Ray Tube Amusement Device | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | Bertie the Brain | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 2 | Nim | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 3 | Draughts | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | OXO | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 5 | Pool | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 6 | Tennis for Two | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 7 | Mouse in the Maze | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 8 | Spacewar! | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
| 9 | The Sumerian Game | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 10 | Periscope | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 11 | Hamurabi | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 12 | Civil War | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
| 13 | Indy 500 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 14 | Yakyuuken | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
15 rows × 24 columns
from sklearn.metrics import jaccard_score
from scipy.spatial.distance import pdist, squareform
# Exclude the 'name' column when calculating the Jaccard distance
from sklearnex import patch_sklearn
patch_sklearn()
jaccard_distance = pdist(jaccard_df.drop(columns='name').values, metric='jaccard')
print(jaccard_distance)
Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)
--------------------------------------------------------------------------- MemoryError Traceback (most recent call last) Cell In[91], line 4 2 from sklearnex import patch_sklearn 3 patch_sklearn() ----> 4 jaccard_distance = pdist(jaccard_df.drop(columns='name').values, metric='jaccard') 5 print(jaccard_distance) File c:\Users\kenar\AppData\Local\Programs\Python\Python311\Lib\site-packages\scipy\spatial\distance.py:2232, in pdist(X, metric, out, **kwargs) 2230 if metric_info is not None: 2231 pdist_fn = metric_info.pdist_func -> 2232 return pdist_fn(X, out=out, **kwargs) 2233 elif mstr.startswith("test_"): 2234 metric_info = _TEST_METRICS.get(mstr, None) MemoryError: Unable to allocate 80.4 GiB for an array with shape (10785618756,) and data type float64
square_jaccard_distances = squareform(jaccard_distance)
print(square_jaccard_distances)
[[0. 1.] [1. 0.]]
Considering the fact that we used pdist it calculates the distance, so it actually shows how different the games are. As we want the opposite, we will reverse it
jaccard_similiarity_array = 1 - square_jaccard_distances
print(jaccard_similiarity_array)
[[1. 0.] [0. 1.]]
distance_df = pd.DataFrame(jaccard_similiarity_array, index = jaccard_df['name'], columns=jaccard_df['name'])
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[83], line 1 ----> 1 distance_df = pd.DataFrame(jaccard_similiarity_array, index = jaccard_df['name'], columns=jaccard_df['name']) File c:\Users\kenar\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\frame.py:827, in DataFrame.__init__(self, data, index, columns, dtype, copy) 816 mgr = dict_to_mgr( 817 # error: Item "ndarray" of "Union[ndarray, Series, Index]" has no 818 # attribute "name" (...) 824 copy=_copy, 825 ) 826 else: --> 827 mgr = ndarray_to_mgr( 828 data, 829 index, 830 columns, 831 dtype=dtype, 832 copy=copy, 833 typ=manager, 834 ) 836 # For data is list-like, or Iterable (will consume into list) 837 elif is_list_like(data): File c:\Users\kenar\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\internals\construction.py:336, in ndarray_to_mgr(values, index, columns, dtype, copy, typ) 331 # _prep_ndarraylike ensures that values.ndim == 2 at this point 332 index, columns = _get_axes( 333 values.shape[0], values.shape[1], index=index, columns=columns 334 ) --> 336 _check_values_indices_shape_match(values, index, columns) 338 if typ == "array": 339 if issubclass(values.dtype.type, str): File c:\Users\kenar\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\internals\construction.py:420, in _check_values_indices_shape_match(values, index, columns) 418 passed = values.shape 419 implied = (len(index), len(columns)) --> 420 raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}") ValueError: Shape of passed values is (2, 2), indices imply (146872, 146872)
distance_df.head()
--------------------------------------------------------------------------- NameError Traceback (most recent call last) Cell In[92], line 1 ----> 1 distance_df.head() NameError: name 'distance_df' is not defined
The next steps, which are about to be taken into consideration are:
1. Continue working on the model so that it will show similiar games
2. Possibly add the other features 'plays','wishlists' etc.
3. Find a way to work around the occasional memory errors